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An  Ultra-low  Power  Memory  with  a 
Subthreshold  Power  Supply  Voltage 


Jinhui  Chen,  Lawrence  T.  Clark,  Senior  Member,  IEEE,  and  Tai-Hua  Chen 

Abstract — A  512  x  13b  ultra- low  power  subthreshold  memory  is  fabricated  on  a  130-nm  process 
technology.  The  fabricated  memory  is  fully  functional  for  read  operation  with  a  190  mV  power 
supply  at  28  kHz,  and  216  mV  for  write  operation.  Single  bits  are  measured  to  read  and  write 
properly  with  Vdd  as  low  as  103  mV  and  129  mV,  respectively.  The  memory  operates  at  a  1 
MHz  clock  rate  with  a  310  mV  power  supply.  This  operating  point  has  1.197  pW  power 
consumption,  of  which  0.366  pW  is  due  to  leakage  and  0.831  pW  is  due  to  dynamic  power 
dissipation.  Analysis  of  the  available  fan-out  or  fan-in  that  can  be  supported  at  a  given  voltage  is 
summarized.  A  number  of  circuit  techniques  are  presented  to  overcome  the  substantially  reduced 
on-to-off  current  ratios  and  the  poor  drive  strength  of  transistors  operating  in  subthreshold.  These 
include  a  gated  feedback  memory  cell,  and  hierarchical  read  and  decode  circuits.  The  memory  is 
dynamic,  with  pseudo-static  operation  provided  via  self-timed  control  of  the  keeper  transistors  to 
mitigate  increased  variability  manifested  in  subthreshold  operation. 

Index  Terms —  High  fan-in/out,  On-to-off  current  ratio,  Subthreshold  memory,  Ultra-low 
power. 
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I.  INTRODUCTION 

Reducing  the  operating  voltage  is  the  most  effective  way  to  reduce  integrated  circuit  power 
consumption.  Recently,  interest  in  operating  CMOS  circuits  with  power  supply  voltages  below 
the  transistor  threshold  voltage  has  been  increasing  [1-4].  Research  into  the  limits  of  this 
technique,  i.e.,  how  low  voltage  can  be  effectively  scaled,  dates  back  to  the  early  1970’ s  [5]. 
Besides  the  quadratic  dynamic  power  savings,  very  low  supply  voltages  promise  greatly  reduced 
leakage  power  [6-7],  For  example,  a  1  V  reduction  in  the  supply  voltage  can  reduce  the  transistor 
leakage  current,  I0ff,  by  over  one  decade  due  to  the  drain-induced-barrier-lowering  (DIBL) 
effect.  The  gate  oxide  leakage  can  also  be  reduced  more  than  100  x  with  the  same  1  V  drop  in 
supply  voltage  [8-9].  Consequently,  subthreshold  circuits  can  allow  ultra-low  power  designs  to 
be  fabricated  on  modern  processes. 

Subthreshold  operation  is  applicable  to  a  wide  range  of  applications,  ranging  from  wireless 
“motes”  [10],  wristwatch  computation  [11],  to  biomedical  applications  such  as  hearing  aids  and 
pacemakers  [12-13],  as  well  as  spacecraft  applications  [14].  Today,  battery  powered  hand-held 
systems  have  been  proliferating  faster  than  other  integrated  circuit  applications.  Examples  such 
as  cell  phones,  MP3  players,  and  portable  games,  abound.  All  benefit  from  increased  battery  life, 
which  lowering  integrated  circuit  (IC)  power  dissipation  can  provide.  In  modern  system-on-chip 
(SOC)  devices,  some  components,  such  as  digital  signal  processors  and  microprocessors  must 
operate  at  high  frequencies,  at  least  intermittently.  Many  components  do  not  need  to  run  as  fast, 
but  must  be  integrated  on  the  same  high-performance  silicon  die.  Additionally,  some  applications 
require  a  small  subset  of  the  circuits  to  operate  continuously,  e.g.,  real-time  clocks  and  wakeup 
circuitry.  Operating  a  subset  of  circuits  with  subthreshold  supply  voltages  may  offer  the  best 
solution  to  otherwise  difficult  power  vs.  performance  compromises  in  process  selection. 
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Despite  the  power  advantages  of  subthreshold  operation,  significant  limitations  exist.  Firstly, 
circuit  speed  is  diminished  due  to  the  low  transistor  drive  current.  This  suggests  reduced  gates 
per  pipeline  stage  to  provide  as  much  speed  as  possible.  Secondly,  high  fan-in/out  circuits 
present  difficult  circuit  design  challenges  due  to  the  reduced  on-to-off  current  ratio,  Ion/Ioff  [15]. 
Memory  read  bit  lines  (RBL’s)  and  write  bit  lines  (WBL’s)  are  examples  of  circuits  with  high 
fan-in  and  fan-out,  respectively.  In  this  paper,  we  refer  to  fan-in  and  fan-out  by  the  amount  of 
transistor  diffusion  loading,  which  creates  leakage  paths  that  diminish  the  circuit  Ion/Ioff  ratio, 
rather  than  the  conventional  capacitive  fan-out.  The  reduced  Ion/Ioff  ratio  in  subthreshold 
operation  results  in  vanishing  noise  margins,  eventually  leading  to  circuit  failure,  and  is  the 
primary  limiter  for  subthreshold  memory  design. 

A.  Reduced  Ion/Ioff  Ratio  in  Subthreshold  Operation 
MOS  transistors  operate  as  transconductors.  The  current  on  a  130  nm  process  technology  is 
shown  in  Fig.  1.  The  leakage  current  I0ff  is  conventionally  defined  at  VGs  =  0  and  Vds  =  VDD  for 
an  NMOS  transistor.  When  VDD  is  above  threshold,  Ion/Ioff  is  nearly  a  constant  over  a  wide 
range  of  VDD.  However,  IDS  drops  exponentially  with  VGs  and  follows  the  subthreshold  slope 
factor  S  when  VGs  <  Vt.  The  drain  current  IDs  is  dominated  by  diffusion  current  for  subthreshold 
operation.  IDs  varies  exponentially  with  the  controlling  gate  input  voltage  VGs  and  supply  voltage 
applying  VDS  as  [16-17] 

/o^/of-Tl-expf-^llexpf1^!  (1) 

UA  l  VJJ  l  nVlh  J 

where 

(2) 

V  *^ox  J 
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Cox  and  Cdepi  are  gate  oxide  capacitance  and  depletion  capacitance,  respectively.  Vth  is  the 
thermal  voltage  and  |i  is  the  electron  mobility.  The  technology  constant  n  characterizes  the 
process  subthreshold  swing,  and  is  usually  between  one  and  two  [18].  As  Vdd  is  scaled  below  Vt, 
the  transistor  Ion  is  diminished  exponentially,  approaching  the  transistor  I0ff-  In  fact,  in 

Vdd/ 

subthreshold  the  Ion  is  just  Ioff  multiplied  by  10  /s  ,  as  evident  from  Fig.  1.  Hence,  the  ratio  of 
Ion/Ioff  may  be  many  orders  of  magnitude  lower  than  for  above  threshold  operation. 

B.  Challenges  in  Subthreshold  Memory  Design 

Logic  gates,  such  as  NAND,  NOR,  or  other  combinational  logic  gates,  work  normally  in 
subthreshold  except  that  their  propagation  delays  increase  exponentially  with  the  reduction  of 
supply  voltage.  However,  sub  threshold  memory  design  is  problematic  due  to  the  reduced  Ion/Ioff 
[15].  It  is  well  known  that  a  traditional  static  random  access  memory  (SRAM)  cell  has  vanishing 
read  stability  at  low  voltage  since  the  read  current  can  raise  the  logic  low  storage  voltage  to  the 
trip-point  of  the  cell  [19].  This  can  be  alleviated  by  the  use  of  a  register  file  type  of  design,  which 
provides  a  read  current  path  independent  of  the  storage  node.  Adding  two  transistors  to  keep  the 
read  current  from  affecting  the  cell  value,  such  as  in  the  conventional  register  file  read  out  circuit, 
increases  the  read  stability  at  low  VDd  dramatically  compared  to  a  traditional  SRAM  cell. 
Simulation  results  for  both  a  traditional  six-transistor  SRAM  and  a  register  file  (RF)  cell  are 
shown  in  Fig.  2  with  Vdd  =  200  mV.  A  RF  cell  maintains  static  noise  margin  (SNM)  in  the 
storage  cell  when  reading,  as  evident  in  the  figure.  Therefore,  in  the  design  described  here,  we 
use  a  RF  type  of  cell  rather  than  an  SRAM  cell.  Of  course,  a  larger  cell  area  (due  to  more 
transistors)  is  required  to  obtain  this  better  SNM  at  lower  operating  voltage.  The  goal  here  is  to 
design  a  reasonably  dense  memory  with  the  lowest  possible  operating  voltage  (Vmin). 
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Consequently,  the  highest  achievable  density  is  not  the  primary  objective  and  is  explicitly  traded- 
off  to  obtain  a  lower  Vmin. 

Besides  SNM,  there  are  additional  difficulties  for  subthreshold  memory  design.  The  high  fan- 
in/out  of  the  WBL  and  RBL,  even  for  relatively  small  numbers  of  cells,  e.g.,  16  or  32  can  result 
in  circuit  failure.  The  key  for  design  is  determination  of  when  the  circuit  fails  as  Vdd  scales 
down,  i.e.,  what  is  the  minimum  operation  voltage,  Vmm.  at  a  given  fan-in/out.  Alternatively, 
determination  of  the  maximum  fan-in/out  at  a  given  operation  voltage,  can  be  used  to  guide  the 
memory  design. 

A  latch  with  no  gating  transistor  in  the  feedback  path,  i.e.,  a  jam  latch  is  generally  used  for  the 
storage  in  RF  cells  to  minimize  the  cell  size  [20-23].  This  can  reduce  the  write-ability  of  the  cell 
at  low  voltages,  since  the  driving  transistor  Ion  can  be  comparable  to  the  feedback  transistor  Ion 
at  process  corners.  Variability  effects  are  larger  when  operating  in  subthreshold,  so  these  effects 
must  be  comprehended.  It  is  also  important  to  avoid  circuit  races,  which  are  more  likely  to  fail  in 
subthreshold  due  to  increased  circuit  variability  at  process  comers  and  due  to  random  variation. 

II.  THEORETICAL  BASIS 

A.  Failure  of  High  Fan-in/out  Circuits  in  Subthreshold 

As  mentioned,  the  low  Ion/Ioff  ratio  presents  difficulties  primarily  for  circuit  nodes  that  have  a 
very  high  or  low  P  to  N  ratio.  The  most  common  cases  occur  on  the  memory  bit-lines,  where  a 
large  number  of  NMOS  drain  connections  are  driven  by  a  single  device.  For  conventional  CMOS 
circuits,  the  inverter  output  is  expected  to  be  VDd  when  the  input  is  low.  However,  the  driver 
output  deviates  from  VDD  when  the  high  fan-out  NMOS  leakage  current,  Ioff,  is  non-negligible 
with  respect  to  Ion  of  the  driving  PMOS.  The  resulting  low  Ion/Ioff  ratio  degrades  the  high 
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output  level  (Voh)  for  high  NMOS  fan-in/out,  and  similarly,  the  low  output  level  (Vol)  cannot 
reach  Vss  for  high  PMOS  fan-in/out.  This  degradation  of  the  circuit  noise  margins  is  increased  by 
process  variation  [15],  An  analytical  model  is  used  here  to  provide  guidelines  for  robust 
subthreshold  memory  design. 


B.  Analytical  Model 

As  the  fan-out  increases  or  VDD  decreases,  the  circuit  static  noise  margin  (SNM)  degrades  until 
the  circuit  fails.  A  gate  with  fan-out  driving  another  gate  without  fan-out,  referred  to  here  as  the 
driver  and  receiver,  respectively,  is  used  as  a  model  circuit.  Focus  here  is  on  high  NMOS  fan-out 
circuits  since  the  high  fan-in  circuit  is  equivalent,  and  since  PMOS  and  NMOS  have  symmetric 
characteristics.  Additionally,  common  circuits  such  as  RF  and  SRAM  have  fan-in/out  dominated 
by  NMOS  transistors.  The  leakage  current  in  high  NMOS  fan-out  helps  provide  v^ver  <  V '™eiver  • 

Focus  is  thus  on  the  driver  Voh  and  receiver  Vm  values  as  they  are  affected  by  operating  voltage 
and  fan-out. 


In  [15],  the  driver  output  and  receiver  input  values  were  derived  as 


t  7  driver  _ t  t 

VOH  ~  Vth 


•In 


(3) 
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K 


o 


•exp 


Vn 


+y„-V 

nV,h 


(7) 


and  a  =  exp(yDD /Vlh ) ,  K  =  K0/(N  + 1)-  N  is  the  fan-out  at  the  circuit  node  in  question,  which 
describes  the  total  width  of  diffusion-connected  gates  normalized  to  the  width  of  the  driver. 
When  operating  with  VDD  above  threshold,  the  term  4Kta  is  negligible  since  a  is  vanishing, 
giving  Vg„er  a  linear  relationship  with  Vdd-  However,  v^'"  is  non-linear  with  respect  to  VDD  in 
subthreshold  operation.  has  a  linear  relationship  with  VDD  in  both  sub  and  above  threshold 

operation. 

The  analytical  model  is  computationally  efficient  since  the  formulations  are  closed-form  and 
require  no  iteration  for  solution.  The  parameters  are  extracted  directly  from  transistor 
characteristics.  The  model  is  verified  by  comparison  with  circuit  simulation  on  the  target  130  nm 
process.  Fig.  3  shows  the  comparison  of  receiver  Vjl  and  Vm  between  the  analytical  model  and 
circuit  simulation.  The  error  is  less  than  10%  of  VDD.  V0l  and  V0h  comparisons  are  also  shown 
in  the  figure,  and  the  results  agree  to  within  4%  of  Vdd- 

y  driver  t  r  receiver 

For  correct  operation,  0H  and  '«  must  satisfy 

t  7  driver  t  ?  receiver  j  C  J 

VOH  ~  VIH  ^  ' 


xr  driver  t  t  receiver 

to  provide  positive  static  noise  margin  (SNM).  Substituting  0H  and  v'»  in  (8)  yields  the 
maximum  fan-out,  Nmax,  at  a  given  operating  voltage  as 


_  (cf-iko  ! 

2C,  +  2  -  4a  1 


(9) 


where 


Cl  =  2a~'piy™ 


y/2 


-1 


(10) 
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Solving  (8)  numerically  yields  the  minimum  operating  supply  voltage,  Vmin.  at  a  given  fan-out  N. 
Fig.  4  shows  the  onset  of  correct  operation  with  positive  SNM  as  Vdd  increases  at  different  fan¬ 
out  N.  As  expected  from  (3),  the  driver  Voh  varies  non-linearly  with  VDD.  The  results  also 
demonstrate  that  the  minimum  supply  voltage  for  correct  operation  Vmin  increases  as  the  fan-out 
goes  up.  Fig.  4  is  for  the  typical  process.  The  fast-N  slow-P  comer  of  operation  raises  Vmin. 
Consequently,  minimizing  the  fan-in/out,  while  achieving  acceptable  array  efficiency,  is 
imperative  for  subthreshold  memory  design. 

in.  CIRCUIT  DESIGN  AND  OPERATION 

The  512  x  13b  memory  is  implemented  hierarchically,  with  sub-banks  each  having  32  rows  of 
13  bit  words  (12  bit  words  plus  one  parity  bit).  There  are  four  memory  banks,  each  comprised  of 
four  sub-banks  arranged  horizontally.  The  physical  layout  is  described  in  detail  in  section  IV. 
Hence,  each  bank  has  32  global  WL’s,  each  driving  local  WL  drivers.  Each  sub-bank  WL  driver 
is  locally  gated.  The  local  BL’s  (those  within  each  sub-bank)  have  16  cells  above  and  16  cells 
below  the  read  and  write  circuits.  The  maximum  circuit  WBL  fan-out  is  thus  16,  to  minimize  the 
required  operating  voltage,  while  still  using  reasonably  conventional  circuits.  The  RBL  fan-in  is 
8  to  limit  the  read  dynamic  power  consumption  and  fan-in  as  described  below.  The  local  RBL’s 
drive  the  global  RBL’s,  which  run  the  height  of  the  array.  The  array  is  organized  with  four  sub¬ 
banks  per  global  BL,  with  their  outputs  multiplexed  at  the  bottom  of  the  array. 

The  chip  core  circuits  operate  at  a  nominal  Vdd  of  1.2  V  while  the  chip  level  EO  circuits 
operate  at  1.5  to  2.5  V.  The  former  utilize  conventional  level  shifting  circuits  [24-26].  The 
subthreshold  memory  circuit  logic  levels  are  obviously  incompatible  with  the  core  voltage.  A 
specially  developed  level  shifter  is  used  to  convert  the  memory  output  data  to  the  chip  core 
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circuit  voltage.  This  level  shifter  differs  from  the  conventional  design  in  its  ability  to  convert 
signals  from  subthreshold  to  above  threshold  with  a  reasonable  NMOS  to  PMOS  size  ratio  [27], 

A.  Memory  Cell 

The  traditional  RF  cell  (see  Fig.  5)  writes  by  pulling  down  on  one  side  or  the  other  when  the 
write  word  line  (WWL)  is  asserted  high,  depending  on  the  write  bit  line  (WBL)  state.  The  node 
to  be  written  is  a  ratioed  circuit,  where  the  input  pull  down  transistors  must  overcome  the  PMOS 
feedback  device  to  write.  In  the  conventional  RF  cell  design,  transistors  N3,  N4  and  N5  are  sized 
to  provide  adequate  strength  to  overcome  the  PMOS  devices  across  process,  voltage,  and 
temperature  (PVT)  corners.  In  subthreshold,  the  variation  effects  are  larger,  and  sizing  may  be 
inadequate. 

Ideally,  the  latch  feedback  (FB)  loop  is  open  during  the  write  operation,  and  closed  when  the 
write  is  complete.  This  is  accomplished  by  gating  one  of  the  feedback  path  transistors,  Pgate,  as 
shown  in  Fig.  5,  which  improves  the  write  margin  in  subthreshold.  The  WWL  controls  the  single 
additional  PMOS  transistor.  The  feedback  inverter  PMOS  helps  to  raise  the  voltage  at  node  XN 
when  that  node  is  being  written  to  a  logic  “1”.  A  simulated  write  operation  is  shown  in  Fig.  6 
with  VDD  =180  mV.  At  the  rising  edge  of  the  WWL,  the  write  transistor  drives  the  output  storage 
node.  The  input  node  transitions  slowly  due  to  the  open  FB  loop.  When  the  loop  is  closed  by  the 
WWL  the  cell  value  is  quickly  updated. 

Fig.  5  has  the  same  read  out  circuit  as  a  conventional  RF  cell,  i.e.,  a  two  transistor  stack  as 
mentioned  in  Section  1(B).  A  typical  conventional  register  file  has  16  or  more  pull-down  NMOS 
transistors  connected  to  the  RBL  [20,22].  This  fan-in  limits  the  memory  Vmm.  To  reduce  the  RBL 
fan-in,  the  outputs  of  two  storage  cells  drive  a  single  stage  complex  CMOS  gate  to  share  one  pull 
down  NMOS  transistor  as  shown  in  Fig.  7.  This  scheme  is  similar  to  the  single  pull-down  per  RF 
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cell  used  in  [28],  The  single  pull  down  transistor  on  the  RBL  reduces  the  diffusion  capacitance  as 
well  as  the  circuit  fan-out.  Since  there  is  no  stack,  the  transistors  can  be  half  as  wide  as  in  a 
conventional  design.  When  combined  with  the  reduced  fan-in,  the  overall  RBL  capacitance 
savings  over  the  conventional  design  approaches  4x,  reducing  both  delay  and  RBL  contribution 
to  the  dynamic  power  dissipation  by  that  amount.  The  capacitive  load  of  the  RWL  is  also 
reduced,  which  improves  speed.  The  single-stage  complex  merge  gate  is  implemented  using 
minimum  sized  transistors,  which  adds  a  small  delay,  but  improves  the  speed  by  a  greater  amount 
due  to  the  reduction  in  the  RWL  capacitive  load.  In  a  conventional  RF  array,  the  RWL  driver  is 
not  located  close  to  the  RBL  pull  down  network,  i.e.  this  domino  pre-charged  circuit  has  a  non¬ 
local  ground,  which  can  increase  noise  at  the  domino  (input)  read  transistors.  In  the  cell  used 
here,  the  static  gate  provides  a  local  ground,  increasing  noise  immunity  as  well  as  combining  the 
cell  outputs. 

B.  Read  Circuits 

The  RBL  is  a  precharge/discharge  domino  circuit.  A  PMOS  keeper  is  usually  used  to  prevent 
RBL  discharge  due  to  leakage,  in  the  case  when  the  read-out  bit  is  logic  one.  In  the  conventional 
domino  keeper  design,  the  RBL  has  high  fan-in,  which  produces  a  large  leakage  current,  IN.  A 
strong  keeper  can  tolerate  greater  leakage  current  but  can  make  the  RBL  un-writable  by  the  cell, 
since  the  RBL  is  a  ratioed  circuit  until  the  keeper  is  turned  off.  The  situation  is  analogous  to  the 
write-ability  issues  outlined  for  the  cell  above.  The  ideal  keeper  size  should  be  variable,  and  this 
problem  has  been  addressed  extensively  [29-30].  The  specific  solution  used  here  is  to  gate  the 
keeper  off  during  the  read,  but  to  turn  it  on  once  the  read  is  complete.  This  provides  no 
contention  during  the  read,  provides  pseudo-static  operation,  and  robustness  against  leakage 


discharging  the  RBL. 
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Replica  timing  was  ruled  out,  based  on  high  circuit  variability  at  low  voltages.  Instead,  a  self- 
timed  circuit,  which  relies  on  cells  attached  to  the  same  RWL  to  generate  a  timing  signal  to 
control  the  keeper  is  used.  This  control  signal  turns  the  keeper  on  or  off  at  a  time  determined  by 
the  RBL  status  during  the  evaluation  phase.  An  additional  PMOS  transistor,  Petri,  is  used  to  gate 
the  keeper  transistor  as  shown  in  Fig.  8.  The  keeper  is  turned  on  or  off  depending  on  both  the 
RBL  status  and  the  keeper  control  signal.  One  bit  cell  is  read  differentially,  and  the  other  12  bits 
have  single-ended  readout.  The  differential  cell  has  both  RBL  and  its  complement  output  RBLN. 
Both  RBL  and  RBLN  are  high  during  the  precharge  phase.  During  the  evaluation  phase,  one  is 
discharged  low  depending  on  the  value  read  out. 

A  differential  cell  does  not  require  a  keeper  control,  since  each  of  the  cross-coupled  keepers  is 
activated  by  the  complimentary  bit  line.  The  local  keeper  control  signal  LKC  (or  global  GKC)  is 
generated  by  a  NAND  gate  that  detects  either  RBL  or  RBLN  transitioning  low  (see  Fig.  8).  Since 
the  differential  and  single-ended  RBL’s  have  nominally  identical  delays,  the  single  ended  RBL 
keepers  are  turned  on  two  gate  delays  after  a  read  is  completed.  Every  sub-bank  has  its  own 
differential  cells,  so  that  variation  is  localized.  This  scheme  is  extended  to  the  PMOS  keepers  on 
the  global  RBL’s  as  well.  The  simulated  read  and  keeper  operation  is  illustrated  in  Fig.  9.  The 
differential  local  RBL’s  labeled  LRBL1  and  LRBLN1,  activate  the  local  keeper  control  signal 
LKC  as  shown.  The  low  assertion  turns  on  the  gating  PMOS  transistor  Petri  in  Fig.  8.  The  keeper 
delayed  from  the  self-timing  signals  allows  buffering  time  and  provides  timing  margin. 

C.  Address  Decode 

The  poor  drive  current  of  transistors  in  subthreshold  mandates  multiple  levels  of  decode  to 
balance  the  driven  capacitances  across  multiple  gate  stages.  Each  sub-bank  has  a  local  WL  driver 
located  at  its  center.  Only  the  WL’s  of  the  selected  sub-bank  are  ever  active,  limiting  total  active 
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power  dissipation.  The  global  decoder  is  located  at  the  center  of  the  memory  array.  The  GWL 
generated  by  the  global  decoder  asserts  the  bank  local  word  line  drivers.  The  global  decoder  and 
GWL  drivers  are  kept  small  by  reducing  the  load  driven,  through  the  hierarchical  WL’s  and 
optimizing  the  logic  levels  in  the  decoder.  The  decoders  use  static  CMOS  logic  exclusively.  The 
decoder  setup  time  is  comparable  to  that  required  to  read  out  the  array  after  the  clock  activates 
the  GWL. 

D.  Operation  for  Low  Power 

As  mentioned,  only  one  sub-bank  is  active  in  any  single  operation  to  minimize  active  power 
dissipation.  The  unselected  sub-bank  RBL’s  remain  in  the  precharged  state,  which  also  keeps 
them  from  activating  the  hierarchical  global  read  BL’s.  The  active  RBL  remains  high  if  the  data 
read  is  “0”,  otherwise,  it  is  discharged.  The  top  and  bottom  local  RBL’s  are  multiplexed  by 
logically  NANDing  them  so  no  multiplexer  select  signal  is  required.  The  simulated  read 
operation  is  shown  in  Fig.  10.  The  local  upper  RBL  (LURBL)  is  discharged  after  the  RWL  is 
enabled.  The  local  I/O  circuit  receives  the  RBL  value  and,  depending  on  the  value,  asserts  the 
gate  of  a  pull  down  NMOS  transistor  (see  Fig.  8)  to  discharge  the  global  RBL  (GRBL). 

IV.  PHYSICAL  DESIGN 

The  design  is  implemented  in  a  130  nm  process  technology.  To  minimize  the  time  delay  due  to 
wire  capacitance  and  overall  power  consumption,  minimum  area  was  a  goal  of  the  design.  The 
differential  and  single  ended  cells  have  the  same  height  of  4.74  pm  as  all  the  cells  in  one  row 
share  both  WWL  and  RWL  signals.  Fig.  11  shows  the  sub-bank  floor  plan.  The  local  decoder  is 
located  at  the  center.  The  differential  column  is  located  next  to  the  decoder,  to  minimize  the 


distance  and  hence  maximize  the  matching  to  the  single-ended  read  out  columns.  It  was 
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constructed  by  adding  a  second  read  circuit  to  the  other  side  of  the  single-ended  cell.  The  local 
I/O  circuit  is  placed  at  the  center,  to  minimi/e  the  RBL  capacitance  by  splitting  16  cells  above 
and  16  below.  The  sub-bank  is  118.5  x  82.5  pm2.  The  overall  memory  layout  is  shown  in  Fig.  12. 
The  total  layout  area  is  520  x  480  pm2.  Access  to  the  memory  is  via  scan  chains  to  limit  the 
required  I/O  count.  A  block  diagram  of  the  memory  partitioning  into  banks  and  sub-banks 
comprises  Fig.  13. 

V.  MEASURED  RESULTS 

The  fabricated  memory  was  measured  at  room  temperature  at  different  power  supply  voltages 
and  operating  frequencies.  Functionality,  maximum  speed  vs.  voltage,  and  power  consumption  at 
different  frequencies  and  VDD  were  determined.  In  the  experimental  setup,  a  Perl  program  is  used 
to  drive  a  FPGA  based  test  board.  The  FPGA  generates  stimulus  to  drive  the  test  chip.  The 
output  can  be  either  observed  by  a  scope  directly  or  returned  to  the  test  board  FPGA  and 
transferred  to  a  PC  through  an  RS-232  port.  In  this  manner,  slow  functional  testing  can  be 
completely  driven  by  software,  without  reprogramming  the  FPGA.  A  separate  pin  is  used  to 
supply  power  to  the  memory  core,  allowing  the  power  consumption  to  be  measured  directly  and 
without  contribution  from  the  test  access  and  scan  circuits. 

A.  Functional  Test 

For  functional  test,  the  target  address  and  data  are  loaded  into  the  scan  chains  and  written  into 
the  memory.  After  loading  the  read  address  into  the  scan  chain,  it  is  transferred  to  the  decoder 
and  enables  the  mapped  word  in  the  selected  sub-bank.  The  read-out  bits  are  sent  to  the  memory 
I/O  circuits  in  parallel.  At  this  point,  the  subthreshold  logic  values  are  level  shifted  to  the  core 
voltage.  The  scan  chain  then  shifts  out  the  data  in  serial  fashion.  A  few  pins  are  brought  out 
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directly  to  facilitate  speed  testing.  Representative  measured  operational  waveforms  comprise  Fig. 
14.  Here,  Gclk  is  the  memory  global  clock.  The  scan  chain  clock  (Scan_clk)  is  asserted  to 
capture  the  values  and  shift  the  data  in  and  out.  The  shift  register  operation,  i.e.,  parallel  load  or 
serial  shift,  is  controlled  by  the  scan  enable  signal.  The  scan  clock  and  serial  data  both  entering 
and  leaving  the  IC  pins  are  evident. 

B.  Minimum  Operational  Voltage 

The  subthreshold  memory  has  a  minimum  theoretical  operational  voltage  (Vmin)  with  the  given 
fan-out/in  of  16  and  8  for  the  WBL  and  RBL,  respectively,  as  described  in  Section  II.  Above  this 
point,  the  memory  has  positive  SNM  for  VDD  down  to  about  180  mV  based  on  the  model.  To 
determine  Vmin  for  data  read  operation,  data  was  written  into  the  memory  cells  in  the  memory  at 
high  voltage,  and  then  read  out  repeatedly,  while  adjusting  the  memory  voltage  from  high  to  low. 
When  the  read  out  data  differs  from  that  written,  the  resulting  Vmin  was  recorded.  The  sub-bank 
failure  voltage  bitmap  for  read  operation  is  shown  in  Fig.  15(a).  The  failure  voltage  is  mostly 
distributed  randomly,  presumably  due  to  random  process  variations,  but  some  row  and  column 
effects  are  evident.  The  highest  Vmin  is  190  mV  and  the  lowest  cell  failure  operation  voltage  is 
103  mV.  The  highest  Vmm  is  somewhat  higher  than  the  simulation  result  of  160  mV,  but  Vmin’s 
are  distributed  around  that  value. 

For  write  operation,  data  is  written  into  cells  at  different  voltages  from  high  to  low.  The  data  is 
then  read  out  at  a  high  voltage.  When  the  read  out  data  is  different  from  that  written,  the  write 
operation  voltage  is  recorded  as  the  writing  failure  voltage  for  the  cell.  The  measured  results  are 
shown  in  Fig.  15(b).  The  highest  write  operation  failure  voltage  is  216  mV,  and  the  lowest  is  129 
mV.  Both  are  higher  than  those  of  the  read  operation.  This  was  expected,  since  the  WBL  has  fan¬ 
out  of  16,  which  is  higher  than  the  fan-in  of  8  on  the  RBL. 
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The  measured  distributions  of  write  failure  voltages  have  a  higher  deviation  than  that  of  the 
read  operation.  The  most  likely  failure  voltage  for  write  operation  is  160  mV,  which  is  higher 
than  the  median  of  150  mV  for  the  read  operation.  Table  1  shows  the  measured  write  and  read 
minimum  operational  supply  voltages.  In  no  case  was  the  measured  minimum  data  retention 
voltage  limiting — the  cells  retain  state  at  lower  power  supply  voltages  than  required  for  reads  and 
writes.  The  effects  of  systematic  and  random  variations  on  the  theoretical  Vmin  are  described  in 
more  detail  in  [15].  In  a  product  operating  in  subthreshold,  the  necessary  supply  voltage  guard 
band  would  depend  on  the  specific  process  variability,  yield,  and  applications. 

C.  Performance  Measurement 

The  memory  achieved  a  maximum  operating  frequency  of  28  kHz  at  VDD  =  190  mV  measured 
at  room  temperature  as  shown  in  Fig.  16.  The  maximum  operating  frequency  (Fmax)  is  2  MHz  at 
VDD  =  325  mV.  As  expected,  the  measured  results  show  speed  reduces  exponentially  when  the 
power  supply  voltage  scales  down  in  the  subthreshold  mode.  Of  course,  the  memory  is  capable  of 
operating  above  threshold,  and  was  tested  to  be  functional  up  to  1.2  V.  The  measured  critical 
path  is  through  the  decoder,  which  limits  the  maximum  operating  frequency  for  a  balanced  clock 
duty  cycle. 

D.  Measured  Power 

The  total  power  includes  both  static  power  and  dynamic  power  as 
p  =p  +p  =p  +aCV2  f  fill 

where  /  is  the  operating  frequency  and  a  is  the  circuit  activity  factor.  The  static  power  is  due  to 
transistor  leakage,  which  is  reduced  as  VDd  scales  down.  Fig.  17  shows  the  measured  test  chip 
speed  at  different  supply  voltages  in  the  subthreshold  region  of  operation.  The  power 
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consumption,  including  the  leakage,  dynamic,  and  total  power  components,  was  measured  with 
the  memory  clocked  at  its  maximum  operational  frequencies  for  each  supply  voltage,  as 
determined  by  the  failure  point.  The  passing  points  are  shown.  The  test  chip  consumes  1.197  |iW 
of  power,  which  includes  0.366  (iW  of  leakage  power  and  0.831  (iW  of  dynamic  power 
consumption  at  1  MHz  clock  rate  and  310  mV  power  supply.  At  higher  supply  voltages,  the 
memory  circuits  are  faster  and  the  total  power  consumption  is  increasingly  dominated  by  the 
dynamic  component.  The  foundry  process  used  for  fabrication  exhibits  relatively  low  leakage  and 
has  low  DIBL,  as  evident.  It  also  has  negligible  gate  leakage  current.  This  leakage  component 
will  be  greatly  reduced  by  the  low  VDd  for  processes  with  high  gate  leakage,  e.g.,  sub- 130  nm 
processes.  At  VDd  <  270  mV  the  leakage  power  exceeds  the  dynamic  power.  The  measured 
dynamic  power  exhibits  the  expected  quadratic  relationship  with  the  power  supply  voltage.  Fig. 
18  shows  the  power  is  linear  with  frequency  at  a  fixed  voltage,  also  as  expected.  The 
measurements,  taken  at  three  Vdd  values,  have  different  Y-axis  intercepts  due  to  the  voltage 
dependence  of  the  leakage  currents. 

E.  Energy  per  Operation 

A  circuit’s  efficiency  is  usually  defined  by  the  energy  consumed  for  each  operation,  i.e., 

E  =  P,o,Jf  (12) 

This  figure  of  merit  is  interesting  as  it  can  define  the  point  where  the  most  computation  can  be 
performed  at  the  least  total  energy,  assuming  that  time  to  complete  the  computation  is  not  a 
constraint.  This  minimum  energy  operating  point  is  thus  important  for  systems  that  must 
maximize  battery  lifetime  in  lieu  of  other  constraints.  Fig.  19  shows  the  subthreshold  memory 
energy  per  operation  at  a  number  of  voltages.  Again,  the  memory  is  operated  at  the  fmax  for  each 
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voltage. 

The  total  energy  consumption  per  operation  includes  both  the  leakage  and  dynamic 
components.  The  figure  shows  that  the  leakage  component  is  exponentially  related  to  the  supply 
voltage.  The  dynamic  component  is  acv*D.  Of  course,  a  is  very  low  in  the  memory,  given  that 
one  in  128  GWL’s  is  asserted,  and  one  in  eight  GBL’s  is  discharged,  per  cycle  on  average.  This 
makes  the  leakage  component  more  important  in  memories  than  in  logic.  The  minimum  Vdd  for 
the  summation  of  those  two  opposite,  monotonic  curves  describing  the  leakage  and  active  power 
components,  is  below  350  mV. 

VI.  CONCLUSIONS 

Operating  with  the  supply  voltage  below  the  threshold  voltage  is  the  most  effective  method  to 
produce  circuits  for  ultra-low  power  applications.  However,  such  low  voltages  create  difficulties, 
particularly  for  memory  design,  since  the  ratio  of  Ion/Ioff  is  greatly  reduced  in  high  fan-in/out 
circuits  such  as  bit  lines.  The  increased  sensitivity  to  PVT  variations  that  sub  threshold  circuits 
exhibit  will  require  substantial  design  margin  to  obtain  high  yields.  Here,  lowering  the  minimum 
operation  voltage  as  much  as  possible  by  design  is  required  to  maximize  the  design  margin  while 
still  operating  at  very  low  VDD.  An  analytical  model  to  determine  the  onset  of  positive  noise 
margin  with  respect  to  VDD  and  circuit  fan-out  was  outlined  and  used  as  the  basis  for  the  memory 
design. 

Sub  threshold  memory  design  requires  unconventional  design  approaches.  A  number  of 
applicable  circuit  and  micro-architecture  level  techniques  have  been  described  here  to  reduce  fan- 
in/out  and  address  the  poor  drive  currents  afforded  in  subthreshold.  These  include  hierarchical 
memory  organization,  reduced  fan-in  by  combining  cell  outputs,  and  self- timed  keeper  controls. 
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The  self-timed  scheme  allows  extensive  use  of  dynamic  circuits  while  allowing  safe,  pseudo¬ 
static  operation  at  extremely  low  operating  voltages.  The  techniques  are  suitable  for  above 
threshold  supply  voltages  where  circuits  exhibit  high  leakage,  such  as  very  high  performance  or 
high  operating  temperature  circuits.  This  has  allowed  a  memory  using  dynamic  read  and 
relatively  high  density  as  opposed  to  previous  single  power  supply  subthreshold  approaches  [2]. 
The  use  of  multiple  memory  supply  voltages  has  been  proposed  to  limit  leakage  power  with  a 
sub  threshold  voltage,  while  avoiding  stability  compromise  by  reading  at  higher  voltages,  using 
conventional  6-T  SRAM  cells  [flaunter]  and  has  been  shown  to  achieve  higher  density 
[reviewer] . 

A  512  x  13b  subthreshold  memory  fabricated  on  a  130-nm  process  technology  was  tested  to  be 
fully  functional  at  190  mV  with  28  KHz  clock  frequency.  The  speed  and  array  efficiency  is  much 
improved  over  that  of  the  subthreshold  memory  design  presented  in  [2] .  Single  bits  can  work  as 
low  as  129  mV.  The  memory  achieves  a  1  MHz  clock  rate  with  a  310  mV  power  supply,  and 
consumes  1.196  pW  at  that  voltage.  The  memory  consumes  1  nJ  of  energy  per  operation  in 
laboratory  measurements  at  room  temperature,  or  less  than  77  fj  of  energy  per  bit,  at  a  345  mV 
supply  voltage. 
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Fig.  1.  MOS  transistor  IDS  vs.  VGs- 


x(V) 

Fig.  2.  SNM  comparison  of  SRAM  and  RF  at  VDD=200  mV 


Fig.  3.  Comparison  between  simulation  and  analytical 
model  for  receiver. 
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vDD(mV) 


Fig.  4.  Different  fan-out  circuits  SNM  vs  VDD.  The  minimum 
operation  voltage,  Vmin  is  the  VDD  where  the  driver  VOH 
exceeds  the  receiver  VIH. 


Fig.  5  RF  cell  with  conventional  read  circuit.  An 
additional  transistor,  Pgate,  is  added  to  aid  write 
margin  in  subthreshold. 


Fig.  6.  Simulated  write  operation  waveform. 


Receiver  V|H(mV) 
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Fig.  8.  Subthreshold  memory  structure. 
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Time  (ns) 


Fig.  9.  Simulated  keeper  control  signal  waveform 


Time  (ns) 

Fig.  10.  Simulated  read  operation  waveform  at  VDD=200  mV. 
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Fig.  1 1 .  Sub-bank  layout. 


Fig.  12.  512  x  13  bits  subthreshold  memory  layout. 
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Serial  I/O  and  level  shifter  circuits 

Fig.  13.  Block  diagram  of  memory  structure. 
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Fig.  14.  Functionality  test. 
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Fig.  15.  Measured  sub-bank  failure  voltage  bitmap  of  (a)  read 
operation  and  (b)  writing  operation. 
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Fig.  16.  Measured  test  chip  speed  vs.  VDD. 


Fig.  17.  Measured  test  chip  power  consumption  in  subthreshold 
mode. 
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Fig.  18.  Measured  total  power  consumption  vs.  operation 
frequency. 


Fig.  19.  Measured  test  chip  energy  consumption  per 
operation. 


Table  1  Comparison  of  measured  results  of  read  and 
writing  operation 
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failure 

voltage 

(mV) 
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failure 

voltage 
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failure 

voltage 
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Write 

operation 

16 

216 

129 

160 

Read 

operation 

8 

190 

103 

150 

